;;; -*- Mode: TEXT -*-
;;; File: AutoClass:doc;ac2-vs-ac3.text
;;;————————————————————————-;;;
;;; AUTOCLASS 3.0 Released 5/11/90 contact: Taylor@pluto.arc.nasa.gov ;;;
;;; by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor ;;;
;;; NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035 ;;;
;;; ;;;
;;; Copyright (C) 1990 Research Institute for Advanced Computer Science. ;;;
;;; All rights reserved. The RIACS Software Policy contains specific ;;;
;;; terms and conditions on the use of this software, and must be ;;;
;;; distributed with any copies. THIS FILE MAY BE REDISTRIBUTED. This ;;;
;;; copyright and notice must be preserved in all copies made of this file.;;;
;;;————————————————————————-;;;
;;; added 6/06/90 for 3.0.2
AutoClass-2 was built around the simplest useful class probability
function. Basically this separately modeled each discrete attribute with a
multinomial distribution and each real attribute with a normal
distribution. Attribute interactions were assumed to be conditional on
the class alone, thus ignoring the possibility of joint discrete and
covariant real distributions. In both attribute types, missing values
were allowed for by conditioning the basic distribution on a binomial
distribution over the meta-values of `known' and `unknown'.
Attributes could also be ignored. AutoClass-3.0 uses essentially
identical class probability functions.
The main difference between the two is in the way the probability
function is implemented. The AutoClass-2 function is built into the
internal search representation. In AutoClass-3 a probability function
is implemented as a structure called a model. The model holds all of
the functions and data specific to the application of the probability
function to a particular data set. It is built at runtime from a user
supplied model specification. The specification lists the types of the
model function terms and the attributes to which they apply. The model
terms define the independent probability distributions applicable to
single or multiple attributes of appropriate type. A model term is
implemented as a set of functions and data structures that are called
from or copied into the model. Thus we obtain great flexibility in
defining specific probability models and in extending the range of such
models.
There has been a considerable increase in the flexibility of input data
formatting. Data vectors can be given in vector, list, or line mode.
Discrete attribute values need no longer be translated to a zero based
integer sequence. Any set of symbols, including strings, may be used.
There is also provision for specifying output translations.
There has been some increase in operating flexibility. Classifications,
databases and models are now implemented as structures. Thus one can
simultaneously work with multiple classifications of single or
multiple databases. There are also a variety of initialization and
search methods that have evolved from our own experiments. For most
users, the most usefull change is the standard search function named
AutoClass-Search. This has evolved from Robin Hanson's expriments on
search efficiency and estimation of the optimal starting number of
classes. Robin has found that by using the simplist (and fastest)
initialization and convergence methods while developing an estimate of
the optimal class number, one can get very good classifications more
quickly than by any other of our methods. This has been implimented in
the function Autoclass-Search, which offers:
Automatic search for the best number of classes
Runtime reports extimate rate of progress.
More flexible choices for what to save to disk, how often, etc.